New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: throw away subchannel references after round_robin is shutdown #8132
core: throw away subchannel references after round_robin is shutdown #8132
Conversation
…own. This avoids receiving balancing state updates after the LB policy is shutdown.
It sounds like round_robin shouldn't have been shut down in that case. If the picker is still being used, the LB policy shouldn't be shut down. The policy shutdown will shut down all its subchannels, so a buffering picker is actually pretty apt. The only other option (and not a bad one, except that updating the picker after shutdown should be a noop) is an erroring picker. This change looks quite fair, but not for the reason presented. |
This sounds fair, and it is what AutoConfiguredLoadbalancerFactory is doing today: first replacing the picker then shut down the current LB policy. We should change xDS to do this in the same way (today it first shuts down the downstream LB policy and then replace the picker). But still, if we do not prevent round_robin triggering balancing state update after it being shutdown, the Channel's picker would still be swapped by it. |
Yes, we should prevent that. Fixing round_robin doesn't "fix" that issue; only fixing all policies and making sure new ones are okay and keeping existing ones from regressing would avoid the need. Much better to address the problem in the caller, since it seems it would be easy. |
Right, I totally agree with it. That's why I was saying "multiple things not working perfectly well, then causes the whole thing being broken". I am going through all callers (or LBs that would potentially encounter things like this) to fix problematic/risky usages. This PR would just fix RR. Does this sound good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this change to preserve the normal invariant that shutdown/unused subchannels are not present in subchannels
. But this should not be considered a fix for any problem anybody is noticing.
…rpc#8132) Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs. This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.
…8132) (#8155) Triggering balancing state updates after already being shutdown can be confusing for the upstream of round_robin. In cases of the callers not managing round_robin's lifecycle (e.g., not ignoring updates after it shuts down round_robin, which it should), it can make problem very bad, especially with the behavior that round_robin is actually propagating TRANSIENT_FAILURE with a picker that buffers RPCs. This change only polishes round_robin by always preserving its invariant. Callers/LBs should not rely on this and should still manage the balancing updates from its downstream correctly based on the downstream's lifetime.
After shutting down a Subchannel, (after 5s delay) its state listener will receive a connectivity state update with SHUTDOWN state. Round_robin will pick up that state update and trigger a balancing state update with TRANSIENT_FAILURE and an empty picker that buffers RPCs to the upstream.
This can cause extremely-hard-to-debug problems such as when round_robin is shut down (e.g., switching to another LB policy, or in complex cases like xDS where endpoint-level load balancing is turned off due to EDS resource being revoked) while its replacement has not yet produced a picker, the Channel updates the picker to buffer RPCs (instead of keep using the current one or fail RPCs).
Note the same thing is fairly safe in pick_first: any subchannel state updates after LB's view of the subchannel's state has become SHUTDOWN will be ignored.